Sung-Hyun Ryu1, Sang-Hyun Choi2*
1,2Dept. Management Information Systems Chungbuk National University, Korea
*Corresponding Author E-mail: rsh1451@cbnu.ac.kr1, Chois@cbnu.ac.kr2
ABSTRACT:
Background/Objectives: With the advent of Web 2.0, a low quality is overflowing in reality. Because human life is connected directly with healthcare, it is important to manage and assess the information.
Methods/Statistical analysis: In this paper, we have analyzed and classified answers for knowledge customers to get high-quality information by using medical noun list in Naver Knowledge in which is a Q&A community site. We gathered 784 questions and 1542 answers.
Findings: The result of accuracy of classification is that ‘Naïve Bayes’ records 46% and matching status scores 60%. It shows that library about lung cancer we developed could be used to filter worthless knowledge whenever knowledge consumer wants to get useful medical information. Our development of keyword library is different from the existing medical library in that we consider keywords of non-expert and expert about medical information.
Improvements/Applications: Experimental results contribute to filter the healthcare information for people who are easily seduce by wrong medical information.
KEYWORDS: Lung cancer, text mining, medical term, healthcare, answer filtering.
1 INTRODUCTION:
From the development of the Internet and personal information via the various platforms of the Web 2.0 era, it became an era that users can send and receive more information more easily. According to that, the information concept of knowledge providers has increased exponentially, and the overflow of information emerged. Schultz and Vandenbosch define the overflow of information as a state beyond the range in which the amount of information is too large to process1. Unlike the general idea that the more information is better, the individual ability to overcome the amount of information that can be processed, it results the deterioration of information processing ability, information fatigue syndrome and other negative consequences.
Also, in the era of information emerging from big data, the definition of past “information” is a comprehensive concept for delivering to people. The current information evolved concretely and multilaterally as customized questions and responses for individuals. Due to this, it is an era when a large amount of information overflows. By allowing all users to share a wide variety of information, problems concerning quality management and errors of all information are newly raised. For this reason, providers and users make an effort to provide highly reliable information, but among information overflowing, it is difficult to select information that meets the needs of knowledge consumer.
It is a general idea that it is necessary to possess knowledge to have competitiveness in an information society. As knowledge stands out importantly and intelligence as an entity producing knowledge is regarded as important2. Stenger explained that one of the important roles of intelligence is that only for so-called "experts" who can utilize the accumulated knowledge, not anyone can create knowledge3. In modern society, however, open educational standards and collaborative places through developed information technology have opened the public, making critical discussions with each other possible. At the core of this discussion, the concept of collective intelligence began to draw attention2.
A community of many online questions and answers emerged worldwide in the information sharing market, as a result of seeing the ranking of the share of the Korean knowledge search site in 2009, the preference of NAVER Knowledge IN comparing to the following DAUM Agora appeared remarkably4. In this way, NAVER Knowledge IN has the biggest advantage that anyone can participate and anyone can access, but the disadvantage is that accuracy and reliability cannot be secured. In order to complement this, experts and netizens have also been modified to participate in answering, but there are a lot of knowledge consumers who are disappointed for the reason that the answers of knowledge provided by intellectuals are not accurate.
In the Internet, netizens have high distrust of health information. So, they have chosen diffusion of unfounded medical information as a major reason. In addition, new words coined to mean inexact diagnosis and patients who prescribe using the search of Internet medical information called "cybercondria" have only occurred in the current society, medical information which is not verified is a problem. Kim Tae Yun, in response to the general knowledge provider, presents scientific basis earlier while advertising in the back part, consumers can easily be misled It is possible that there is a possibility5.
It is that reliance and knowledge, helping consumers to obtain the necessary level of information, in that the interest in current health care has increased and many relevant medical knowledge is utilized importantly. Therefore, we are seeking a way to reduce the cost of processing and utilizing the more information for knowledge consumers among various information sharing platforms. With the arrival of the custom consumer era, studying whether keywords of responses can provide necessary levels of information to consumers, rather than studies biased towards filtering of advertisements, among medical knowledge search services.
It seems that there is significance that it is meaningful to grasp the distribution of intellectual comment, to grasp consumption situation of consumer's medical knowledge and to complement the limit of collective intelligence. Collective intelligence restores new value like the emergence of Web 2.0. Collective intelligence begins to be used as a conceptual aspect rather than access on the application side and the concept that the knowledge generated in the group is more useful than the individual knowledge is a methodology that can guide changes in the web environment6.
The concept of collective intelligence was researched led by Levyand he insisted that all the actions sharing secrets of each other and making multiple answers for each question by multiple netizen can be seen with collective intelligence7.In collective intelligence, the meaning of "collective" is not an expert but shared the knowledge that he experienced and came to mean an ordinary person who jointly produces this2.
There are several studies on Web 2.0 and collective intelligence. Lee Jong Chelhas proved and pointed out the limit of information error based on the principle of free collaboration by selecting Wikepedia that can search for knowledge as a representative example of "collective intelligence"8.
Knowledge sharing community site in domestic 1st knowledge Knowledge IN is a representative example of collective intelligence that will be activated in the future with the development of digital literacy. Evaluated as a creative platform in these new Internet era, "accurate" and "reliability" are emerging as big problems as knowledge is piled up8,9.
In research through communities representing collective intelligence, we can know the severity of some information error in the process knowledge and show that knowledge consumers communicate and share each other. According to Schultz and Vandenbosch, overflow of information means a state in which the amount of information that can be processed overflows1. Initially, departing from psychology, it was not a quantitative concept, but a behavioral center based on a huge amount of information to humans. Miller asserts the theory that the vast amount of information may lower the processing capacity of information or cause difficulty of choosing, but this is the basis of the theory such as cognitive surplus, knowledge excess, information fatigue phenomenon10. With the development of other Internet, privacy level social problems emerged10. There are researches that other negative consequences results that influence the current state of psychological anxiety and the various physical disorders caused by this as the information environment changes11.
In the definition study of Lee Hwan Soo, Lim Dong Won , Zo Hang Jung clause, we expanded the theory of information over-theory and user's resistance and adapted it to the situation of information system10. However, it is the limit that analyzed only in the dimension of excessive information without considering individual characteristics.
In this paper, we can give support to increase human decision making ability by offering customized information of knowledge consumers by empirical research during big data and information overtime. And we can mitigate inefficient knowledge consumption behavior.
There are several studies on the reliability of knowledge sharing sites. Park Ju Bum, Jun Dong Yol tried to grasp the expertise and accuracy of this site, but it can respond and there are difficulties in taking 18 representatives12. Park Sun Jin, which analyzed how this content was based on the facts, mainly on the knowledge of science in elementary school, explains why respondents have mostly based on the experiences of respondents and Internet resources13. Many of these materials have to be corrected and complemented, but the questioner sometimes causes confusion and many problems with unverified responses alleging that they adopt the answer So, you must clearly state the source. We also propose that capabilities and institutions are needed to selectively accommodate scientific knowledge at national or individual level.
There was no research which directly carried out the knowledge search site of the Internet. However, Lee Jong Chel, Oh Jin A analyzed the actual situation and current situation of information error by directly using knowledge information of neighbor and the following site8. Kim Tae Yunproposed a method by which all questions and types can be read via qualitative content analysis and the health related answers can be automatically classified into information and advertisements through analysis5. However, this should only be indicated as a very narrow area classification that falls under the knowledge of health care and medical related, the method by which advertisements can be filtered must be expanded to other areas to demonstrate new verification.
2. METHOD OF ANALYZING THE VALIDITY OF MEDICAL INFORMATION:
2.1. Data:
According to previous studies, information without medical grounds has severely been used without any proper filtering process. Naver Knowledge Search that ranked in the top of daily average visitors and share per hour as of 2009 is considered to represent health data. Therefore, a total of 1,542 Q&As of health care were collected from Naver Knowledge Search, and their contents were analyzed.
The results of the Naver Knowledge Search and the collected data set are presented as shown in table 1:
To make it easy to process and change data in Naver, this study put new variables including answer and question categories to use for analysis, in addition to the basically offered data, such as question, answer, author, and date, in Excel. The data had been collected from August 1 to 10, 2015. The collected 784 questions and 1542 answers.
In preliminary research, these researchers had cooperation with medical staff to choose categories of health questions and answers in detail. In fact, one question had more than one answer so that multiple answers were allowed. If a question is involved in a category, it is marked with ‘1‘, and otherwise it is marked with ’0‘. The definitions of answer categories are presented in Table 2. Based on them, classification was performed.
|
Answer Categorization |
Definition |
|
Cure |
In the case of answers that can include all treatments of drugs, radiation, surgery, treatment and definition |
|
Symptom |
In the early stage of illness, symptoms at each time period, pain, symptoms and phenomenon when received treatment etc. |
|
Diagnosis |
In the case of content that the respondent arbitrarily judges and explains the symptoms of the questioner, diagnosis. |
|
Oriental medicine |
In the case of answers that recommend inviting Kampo treatment excluding the existence of medical grounds or visiting a clinic, it is defined as Kampo |
|
Advice |
when a respondent empathizes with a questioner ,or presents a guide for treatment on the basis of his / her experience as an example |
|
Religion |
There is no scientific basis at the treatment level of disease, but in the case of replying to support the treatment of miracles and certain religions |
|
Management |
Food that is good for the sick, exercises, medical treatment |
|
Etc |
When it is a theme that does not satisfy the question at all,or a case related to industry accidents, etc. |
|
Advertisement |
In the case of a statement promoting a specific product or guiding access to a specific home page |
To change a new data set, this study classified data in the information area and in non-information area as shown in figure 1. The information area includes management, treatment, symptoms, diagnosis, and others which are judged to be used as information for knowledge users. The non-information area includes oriental medicine, religion, food, and advertisement which can be used as wrong information.
In the existing answer category5, advice had something obscure to be judged, and was not involved in most answers. Therefore, it was judged to fail to meet the purpose of this study which is to deliver accurate medical information. Advice was not considered, and a new category was chosen.
In case of the ‘management’ category in previous research, it was divided into ‘food’ and ‘care’ (e.g., exercise method), and categories 7 and 8 were created additionally as we can see in table 3. In case of the ‘number’ category 1, to make it easy to look at in the further analysis, each number from 1 to 10 was put to each one of answer categories. Among the classified data set, the variables that met the purpose of this study were chosen as shown in the below figure, and were used in the further analysis.
2.2. Research Model:
The answer data of Knowledge Search Community were classified in the answer category. All other answers than advertisement, oriental medicine, religion, and food which had no medical grounds were used for establishing a dictionary. Finally, with the use of frequency, the mechanism to delivery information to knowledge consumers was designed.
2.2.1. Online Q&A Community:
Naver Knowledge In has grown to the Korea’s top online knowledge sharing community. On the day of Dec. 17, 2016, 22,188 questions and 35,556 answers were newly created.
In Naver Knowledge Search system, when a user asks a question, other users and experts answer the question. In the repetition of the process, collective intelligence has been achieved and has offered useful benefits to many knowledge consumers. Collective intelligence is evaluated to be an innovative model for knowledge establishment, but faces many problems, such as lack of expertise, accuracy, and reliability.
Therefore, these researchers try to make use of Q&A contents relating to health among many areas in order to prove their availability.
2.2.2. 1st Filtering:
According to previous studies, the reliability of the medical answer greatly influenced knowledge consumers. The emphasized the importance of essential information filtering. Therefore, this study used the data which was used to match the answers of medical staff with certain categories as the 1st filtering variable and took off the answers that were involved in the category that was judged to fail to provide medical information. Of the total 1542 answers, the answers in the categories of advertisement, religion, and food which were judged to fail to be used as information for knowledge consumers were filtered. As a result, the largest 921 answers were removed from the category of advertisement. Therefore, it was found that there was a lot of senseless medical information.
2.2.3. Data preprocessing:
Through Phython, the Korean Morpheme Analyzer called ‘Kokoma’ tolerant to spacing and typo was used for extracting the keywords of 472 answers. As a result, extracted were 7445 keywords which were refined in accordance with the below criteria.
· Meaningless numbers
· Name
· Typing error
· A noun whose meaning is unknown among one character words
· Adverbs, adjectives, verbs classified incorrectly
· When a specific media or group name is output separately, use of a noun used together
Numbers were removed. After pre-processing, 6718 keywords were extracted. As the result of the final six steps, 4180 keywords came to be found. The below figure illustrates the nouns extracted by the morpheme analyzer.
2.2.4. Category library Development(2nd filtering):
Value Chain of Michael Porter is a model that helps classify a firm’s value offering activities for customers into primary and support activities and thereby to analyze competitive advantages to find a process for creating an added value. Figure 2.illustrates that the primary activities in the value chain are applied to disease stages, and knowledge consumers realize a disease stage and obtain proper information.
With the use of the value chain of Michael Porter, this study classified primary activities-the processes of disease occurrence and treatment-by stages and newly defined them as the primary activities of diseases. As shown in table 4, the primary activities of disease were defined in order to classify keyword categories.
|
Categories (number of words) |
Definition of Categories |
|
Symptom(105) |
The phenomenon before diagnose diseases |
|
Cure(334) |
Operations, treatment or medicine that is prescribed by hospital |
|
Diagnosis(479) |
Diseases, departments that diagnose diseases, doctors, hospital |
|
Management(89) |
Treatments(Exercise) that can helpful to deal with disease except for the hospital care |
|
ETC.(3172) |
ETC. |
2.2.5. Keyword frequency Extraction:
The morpheme analyzer was used for keyword extraction and refining process. After that, they were classified in accordance with the aforementioned category criteria. With the use of ‘Eclipse’, a freely distributed universal application software platform, keyword frequency was saved in a .csv file. As shown in the figure, in the 1st answer of 472 answers, the frequency of the words in the list of symptoms keywords was saved in a .csv file. In the first answer, the word ‘cough’ appeared four times.
Therefore, the main words used for primary activities were found, and empirical analysis on the answers related to health was conducted. In each one of 472 answers, the words in each category of treatment, symptoms, diagnosis, management, and others were counted and totaled. They were used as a new variable for the further analysis.
2.2.6. Content Analysis:
The typical definition of content analysis was proposed by Berelson who defined it as ‘a research technique for the objective, systematic and quantitative description of the manifest content of communication’. Holsti offers a definition of content analysis as ‘any technique for making inferences by objectively and systematically identifying specified characteristics of messages’. The large-scaled analysis requires a lot of efforts and time for results. With the development of computer, however, it is reevaluated as a diversified analysis technique.
These researchers classified the vocabularies used in the answers related to health on online Q&A information sharing communities by categories and analyzed various characteristics through descriptive statistics. In the answers related to specific diseases which can be used as medical information, this study empirically analyzed the types of information offered. By using a word in the unit of content analysis, this study measured and analyzed how often the category of the words actually used by experts or non-experts appears in the medical answers.
2.2.7. Document classification and information filtering:
In this day and age, a lot of information has been generated with the development of a variety of internet based digital media, aside from books and newspapers. Therefore, it has been necessary to search for documents efficiently and develop classification techniques in the huge information. As one of the typical techniques of information search, Naive Bayesian classifier is used most.
This study used the category keywords of each answer obtained by the aforementioned research method in order to calculate the frequencies of the primary activities of treatment, symptoms, diagnosis, management, and others and apply them as an information filtering variable. The total of keyword frequencies was calculated as shown in the example of table 5, and a new data set was established.
There are different words depending on keyword category dictionaries. For the reason, the more words are involved in a category dictionary, the larger the total of frequencies is. Therefore, to solve the problem, a frequency was divided by as many words as involved in a dictionary. The formula is presented as below:
(1)
Cwf2 :Total frequency of keyword category library
f1 :Frequency of each keyword categories
The result of the formula was used for making a data set which is shown in the below table 6. Typically, there are 10 sets. The numbers next to the variables of primary activities were randomly assigned in the pre-processing step: treatment (1), symptoms (2), diagnosis (3), management (8), and others (10). If a category with the highest number is involved in one of number categories 1, 2, 3, and 4, the variable of consistency is defined as ‘X’; otherwise, it is defined as ‘O’. In the first sentence, diagnosis (3) had the highest number and was not involved in one of the four number categories. Therefore, it was defined as ‘X’.
3. RESULTS AND DISCUSSION:
3.1. Content Analysis:
3.1.1. Primary Activities keywords classification and analysis:
As the result of the classification of keywords by dictionaries, the keywords related to diagnosis numbered 479, the largest. It indicates that the information obtained by knowledge consumers is mainly related to diseases and particular hospitals.
A keyword dictionary was established and a list of words by categories was created. The total of the frequencies of words by keyword categories was calculated, and alignment was made in ascending order.
The keywords of five categories were listed in descending order by frequency, and therefore top 20 words were extracted as shown in the below table 7.
Given all, the words most involved in 472 answers were analyzed in terms of frequency, and a level of the area of the vocabularies used by experts and non-experts was examined. In case of treatment in primary activities, anticancer treatment, biopsy, and radiation treatment were used much as treatment names. A question was about the relation with a particular disease and its treatment method so that the words relating to the treatment methods of the particular disease were found highly in the answer of the question. In case of symptoms of liver cancer, couch, phlegm, and hemoptysis occur much. A question about liver cancer had many answers related to symptoms. In the diagnosis category including disease name and treatment department, cancer, liver cancer, and the departments for cancer diagnosis, such as pulmonology and internal medicine, were used much. The noticeable words used most in the management category were nucleic acid which is a sort of nutritional supplements for liver cancer patients and was recommended much. Aside from that, health insurance and natural healing drew a lot of interest. In the others category, the words related to body regions were found much. In the future, it will be necessary to specify categories to supplement dictionary categories.:
2.2.2. Information filtering:
Document classification means a technique of classifying the constantly incoming documents in the most proper categories. As one of the typical machine learning methods, Naïve Bayes is known to have simple and accurate assumption ability so that it has been applied to many document classification projects (Jae-young Jang, 2006). To compare document classification performance, this study performed Naïve Bayes by using Weka, the free data mining program developed by the team led by Prof. Ian Witten at Waikato University in New Zealand. The free program is useful, for it has various algorithms and visual analysis function. In terms of performance comparison, this study tried to compare the methodology using the variable of consistency and Naïve Bayes.
In Naïve Bayes, a number category was designated as a target variable and its percentage split was set to 80%. As a result, the classification accuracy of categories was 46% as shown in the below figure 3.
This study used as the variable of consistency the case where the number calculated with the ‘frequency of words in each answer/total category dictionary words’ was used for proper category classification. Of the total 472 answers, the answers with ‘O’ numbered 283, and those with ‘X’ 189. The classification accuracy of 60% was found. Therefore, the method proposed in this study had higher performance than the traditional document classification technique.
This study used the frequency of words in a sentence to propose the algorithm with 60% performance which can classify the answers including the information related to primary. It will be necessary to make more efforts to improve its performance.
4. CONCLUSION:
With the advent of the Internet and Web 2.0, various platforms have arisen and the results of knowledge consumers' decision making becomes increasingly difficult due to the amount of information occurring within them, some research and experimentation There is a place that became obvious. As a result, we tried to analyze from a textual point of view, recognizing the importance of providing accurate and necessary knowledgeable information to consumers in health related information. In the online knowledge sharing community, answers related to specific diseases were tagged based on the criteria of nine categories that were selected with the help of specialized medical teams first. Responses corresponding to nine administrations were re-classified into foods and management and further analyzed with 472 answers excluding answers classified as advertisement, religion, and food from among respondents. As a result of extracting the keywords of the text and classifying them according to the intrinsic activities of the disease value chain 5, 479 keywords were the largest in the diagnostic category excluding the etc.
We proposed a method by which we can use the dictionary constructed through this analysis to classify lung cancer related responses and provide information related to intrinsic activity. We hope that it will help solve the problem of the information overtime and prepare an automatic document classification system.
5. ACKNOWLEDGMENT:
This research was supported by the MSIP(Ministry of Science, ICT and Future Planning), Korea, under the ITRC(Information Technology Research Center) support program (IITP-2016-H8501-16-1013) supervised by the IITP(Institute for Information and communication Technology Promotion.
This work was supported by the intramural research grant of Chungbuk National University in 2015.
6. REFERENCES:
1. Schultz, U. and Vandenbosch, B., Information overload in a groupware environment: Now you see it, now you don't, Journal of Organizational Computing and Electronic Commerce, 1998, 8(2), pp.127-148.
2. Choi Hang Sup, Theoretical Study on the meaning of Collective Intelligence of Pierre Levy, Cyber communication Academic Society, 2009, 26(3), pp.287-322. http://libproxy.chungbuk.ac.kr/90a6552/_Lib_Proxy_Url/www.riss.kr/search/detail/DetailView.do?p_mat_type=1a0202e37d52c72d&control_no=b60646c9b6445533ffe0bdc3ef48d419
3. Stenger. I, le Defi de la production de l’ intelligence collective, Multitude, 2005, 20, pp117-124.
4. Gwon Soon Chan, Effects of knowledge search service on the knowledge formation of high school students were studied, Kongju National University, 2009, pp1~59. http://libproxy.chungbuk.ac.kr/90a6552/_Lib_Proxy_Url/www.riss.kr/search/detail/DetailView.do?p_mat_type=be54d9b8bc7cdb09&control_no=90e48ba697b8d319ffe0bdc3ef48d419
5. Kim Tae Yun, Validation of answering credibility in online healthcare Q&A community, Chungbuk National University, 2016, 50, pp1~50. http://libproxy.chungbuk.ac.kr/90a6552/_Lib_Proxy_Url/www.riss.kr/search/detail/DetailView.do?p_mat_type=be54d9b8bc7cdb09&control_no=8bb611c0ed368fbdffe0bdc3ef48d419
6. Park Jae Chon, Shin Ji Woong, Research on how to utilize collective intelligence on Web 2.0 platform, KSII Transactions on Internet and Information Systems, 2007, 8(2), pp15-20. http://www.riss.kr/search/detail/DetailView.do?p_mat_type=1a0202e37d52c72d&control_no=b45e966ecef64ea8ffe0bdc3ef48d419#redirect
7. Levy, P, L’intelligence collective, pour uneanthropologie du cyberespace, Paris: La Découverte, 1994.
8. Lee Jong Chel, Oh Jin A , A Study on Analysis for Accuracy of Knowledge and Information Related to Society and History on Internet Knowledge Search Service Program: For Situation of NAVER KNOWLEDGE iN and DAUM KNOWLEDGE, Asian journal for public opinion research : AJPOR, 2014, 15(2), pp149-186.http://libproxy.chungbuk.ac.kr/90a6552/_Lib_Proxy_Url/www.riss.kr/search/detail/DetailView.do?p_mat_type=1a0202e37d52c72d&control_no=8d6e065bf97018827f7a54760bb41745
9. Kim Sung Min, Qualitative Research on the Collective Intelligence of Internet Users, 2007, Chung-Ang University.http://libproxy.chungbuk.ac.kr/90a6552/_Lib_Proxy_Url/www.riss.kr/search/detail/DetailView.do?p_mat_type=be54d9b8bc7cdb09&control_no=a817ad3c9797f817ffe0bdc3ef48d419
10. Lee Hwan Soo, Lim Dong Won, Zo Hang Jung, Personal Information Overload and User Resistance in the Big Data Age, Korea Intelligent Information System Society, 2013, 19(1), pp125-139. http://libproxy.chungbuk.ac.kr/90a6552/_Lib_Proxy_Url/www.riss.kr/search/detail/DetailView.do?p_mat_type=1a0202e37d52c72d&control_no=809473975d6ee248ffe0bdc3ef48d419
11. Kum G.T., Choi H.J., A Study of the impact of information overload on emotional and physical health, Korean Political Communication Association, 2010, 16(1), pp.5~32.
12. Park Joo-Bum , Jeong Dong-Youl, An Empirical Study of Web - based Question - Answer Services, Korea Society for Information Management, 2004, 21(3), pp.83-98. http://libproxy.chungbuk.ac.kr/90a6552/_Lib_Proxy_Url/www.riss.kr/search/detail/DetailView.do?p_mat_type=1a0202e37d52c72d&control_no=60c97425dd76ed72ffe0bdc3ef48d419#redirect
13. ParkSun Jin, Analysis on the Questions and the Answers' Contents related to Scientific Knowledge of Elementary School in Internet Knowledge-based Search Engine, NAVER, Gwangju National University of Education, 2008,pp. 1-80. http://libproxy.chungbuk.ac.kr/90a6552/_Lib_Proxy_Url/www.riss.kr/search/detail/DetailView.do?p_mat_type=be54d9b8bc7cdb09&control_no=93766e22a8d233e9ffe0bdc3ef48d419#redirect
Received on 22.06.2017 Modified on 18.07.2017
Accepted on 21.07.2017 © RJPT All right reserved
Research J. Pharm. and Tech. 2017; 10(7): 2313-2321.
DOI: 10.5958/0974-360X.2017.00410.3